# Analyze topics over different time periods
library(topicmodels)
library(tm)
library(ldatuning)
# Preprocess the corpus
corpus <- Corpus(VectorSource(documents))
dtm <- DocumentTermMatrix(corpus)
# Fit LDA model
lda_model <- LDA(dtm, k = num_topics)Hypothetical model
Modeling strategy
Which model could help me detect the change over time of a certain concept in a corpora of company documents
To detect the change over time of a certain concept in a corpus of company documents, you can use a combination of Topic Modeling, Word Embeddings, and Temporal Analysis techniques. Here are some models and methods that can help with this task:
1. Dynamic Topic Models (DTM)
- Dynamic Topic Models extend traditional topic modeling approaches like Latent Dirichlet Allocation (LDA) to account for changes in topics over time.
- DTM models the evolution of topics in a time series, allowing you to see how the prominence of certain topics or concepts shifts over different time periods.
- This is useful if you want to track how the discussion around a concept evolves within your document corpus.
2. Temporal Word Embeddings
- Temporal Word Embeddings are extensions of static word embeddings like Word2Vec or GloVe that capture how the meaning of words changes over time.
- By training word embeddings separately for different time periods (e.g., by splitting your corpus into time segments), you can analyze how the vector representation of a concept shifts over time.
- Models like Temporal Word2Vec or Dynamic Word Embeddings can help identify changes in how a concept is discussed or understood over time.
library(wordVectors)
# Train word2vec model for each time period
model_t1 <- train_word2vec("text_time_period_1.txt", output_file = "vec_t1.bin")
model_t2 <- train_word2vec("text_time_period_2.txt", output_file = "vec_t2.bin")
# Compare word embeddings over time3. BERT with Temporal Fine-tuning
- BERT (Bidirectional Encoder Representations from Transformers) can be fine-tuned on your document corpus, with separate fine-tuning stages for different time periods.
- By comparing the contextual embeddings of the concept across different time periods, you can analyze how the context and usage of the concept have changed over time.
- Alternatively, you can use a time-aware variant like BERTime, which is designed to capture temporal dynamics in textual data.
normally done with python
library(bertR)
# Load pre-trained BERT model and fine-tune on time-specific data
model <- bert_load("bert-base-uncased")
fine_tuned_model <- bert_finetune(model, train_data)
# Generate embeddings and analyze changes over time4. Sentence Transformers with Time-Based Clustering
- Use Sentence Transformers (e.g., SBERT) to generate embeddings for sentences or paragraphs discussing the concept.
- Apply clustering algorithms (with packages like
clusterorfactoextra) to these embeddings over different time periods to see how the clustering of topics around the concept changes. - This approach is effective for tracking nuanced changes in how the concept is discussed at different points in time.
5. Sequential Neural Networks for Temporal Sequence Prediction
- Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks can be used to model sequences of documents or sentences over time.
- By training these models to predict future text sequences, you can analyze how the patterns associated with a concept evolve, which can help you detect changes in its importance or context.
library(keras)
library(tensorflow)
# Define and train LSTM model
model <- keras_model_sequential() %>%
layer_lstm(units = 50, return_sequences = TRUE, input_shape = c(time_steps, features)) %>%
layer_dense(units = 1)
model %>% compile(loss = 'mse', optimizer = 'adam')
history <- model %>% fit(x_train, y_train, epochs = 20, batch_size = 32)6. Change Point Detection Algorithms
- Integrating change point detection algorithms (e.g., Bayesian change point detection) with NLP techniques allows you to pinpoint when significant shifts in the usage or context of a concept occur.
- This is particularly useful if you’re interested in identifying specific events or periods that caused a shift in how a concept is discussed.
library(changepoint)
library(bcp)
# Apply change point detection to time series of concept scores
cpt <- cpt.meanvar(your_time_series_data)
plot(cpt)Workflow Example:
Data Collection: The researchers collected a large corpus of CSR reports from Fortune 500 companies spanning multiple years (e.g., from the 1990s to the 2010s).
Preprocessing: The text from these reports was cleaned, tokenized, and prepared for analysis. Segment your document corpus by time (e.g., by year or quarter).
-
Modeling:
- Apply Dynamic Topic Models, e.g. Latent Dirichlet Allocation (LDA) to identify key topics discussed in the reports. By applying LDA to different time segments (e.g., reports from the 1990s, 2000s, 2010s), you could track the prominence of each topic over time.
- or train Temporal Word Embeddings for each time segment.
- Alternatively, fine-tune BERT or Sentence Transformers for each time segment.
ES (Zhang, Kim, and Xing 2015) This study utilizes
Dynamic Topic Models (DTM)to analyze and track changes in market competition by analyzing text data from financial news articles and company reports. The focus is on understanding how discussions about companies and markets evolve over time and how these changes correlate with stock market returns.
-
Temporal Analysis:
- Track the changes in topic distributions or embedding vectors associated with your concept.
- Use clustering, cosine similarity, or other distance metrics to quantify the change over time.
- Apply change point detection to identify periods of significant shift.
-
Word Embeddings:
- Word2Vec models were trained on CSR reports from different time periods to observe how the semantic meaning of key terms (e.g., “sustainability,” “diversity”) evolved.
- By comparing the embeddings of these terms over time, they assessed shifts in how these concepts were discussed.
- Change Detection:
- The study utilized statistical methods to identify significant change points in the emphasis on particular CSR themes. This helped pinpoint specific events or external pressures (like new regulations) that may have driven changes in corporate communication.
By using these models, you can systematically detect and analyze how the concept of interest changes over time in your corpus of company documents.